The main purpose of this project is to evaluate the impact of social and behavioral factors on COVID-19 transmissibility. In order to accomplish this, we looked at the ‘covid2020’ dataset which is a merged dataset composed of the ‘countryhealthstats’ dataset from the World Bank website, ‘diet2020’ dataset from Kaggle, and ‘totaltests’ dataset from the ‘Our World in Data’ website. The variables contained in the ‘covid2020’ dataset are the % of urban population, the % of unemployment, the % of rural population, the % of people above the age of 65, the % of people with access to proper sanitation, alcohol intake as a percentage of total calories in diet, the % of obesity,the % of population with confirmed covid cases, the % of population dead due to COVID-19, the population, and the total yearly tests for different countries around the world. This dataset also has 102 observations with each observation representing a unique country. The ‘covid2020’ dataset is particularly interesting to us because social and behavioral factors described in this dataset shape the everyday life of people and thus will no doubt impact the disease prevalence of diseases like COVID-19 around the world. Some interesting trends we expect are a positive correlation between the percentage of elderly population in a country and the death due to COVID-19 rate, and we also expect health conditions like obesity and undernourishment to be positively correlated with COVID-19 occurrence. The ‘countryhealthstats’ dataset was tidied before it merged with the other two datasets to create the ‘covid2020’ dataset. In the ‘countryhealthstats’ dataset, we moved the different variables from the ‘Series Name’ column to separate columns of their own using pivot_wider, thus tidying the dataset. In this project, we also explore the dataset using PCA and clustering to view the different ways in which the countries are grouped or related. We also import a new dataset which reports the HDI of each country from the World Population Review Website and compare to see if the country clusters match the development status of the countries. Finally, we use classification to see if we can use the COVID-19 related variables in the dataset to predict the development status of the country and then we evaluate that classification model to check if it has any signs of overfitting.
#Call all the relevant packages
library(tidyverse)
library(cluster)
library(factoextra)
library(GGally)
library(ggplot2)
library(plotROC)
library(caret)
library(psych)
# Call the 'covid2020' dataset
library(readr)
covid2020 <- read_csv("covid2020.csv")
#Remove the variable 'Handwash%' from the 'covid2020' dataset as it contains too many 'NA' values
#covid2020 <- covid2020 %>% select(-`Handwash%`, -Undernourished)
# Save a new dataset called 'covid2020num' containing only the numeric variables from the 'covid2020' dataset
covid2020num <- covid2020 %>% select(is.numeric)
# Make a correlation matrix with bivariate and univariate graphs for the 'covid2020num' dataset
pairs.panels(covid2020num,
method = "pearson", # correlation coefficient method
hist.col = "blue", # color of histogram
smooth = FALSE, density = FALSE, ellipses = FALSE, cex.labels=2, cex.axis=1.5)
The needed packages were called then our covid2020 dataset, used from our last project, was uploaded using the read_csv function. The covid2020 dataset included the country names and statistics about their population such as the urban population %, the confirmed cases, %, the amount of yearly tests, etc. We then selected only the numeric variables from our covid2020 dataset and saved it to covid2020num and made a correlation matrix with bivariate and univariate graphs for its respective numeric variables. Technically, the UrbanPop% and the RuralPop% were the most correlated with a correlation coefficient of -1.00 but this was expected as they are complete opposite statistics. The confirmed case and death percentage were the second most correlated with a correlation coefficient of about .83 which makes sense as the death percentage due to covid19 is sure to be in high correspondence among those who had caught the virus with the lethality of covid19 around the world. Unemployment% and yearly_tests as well as Alcohol% and yearly_tests were tied for the least correlated pairs of variables with correlation coefficients of 0.01 and -0.01 respectively. This makes sense as one would expect these variables to have no relationship with each other.
#Make a 'covid2020scaled' dataset containing only scaled numeric variables with no missing values from the 'covid2020num' dataset
covid2020scaled <- covid2020num %>% scale %>% as.data.frame() %>% na.omit
#Conduct a pca on the 'covid2020scaled' dataset
pca <- covid2020scaled %>% prcomp()
# Visualize percentage of variances for each PC in a scree plot
fviz_eig(pca, addlabels = TRUE, ylim = c(0, 70))
# Visualize the 5 top contributions of the variables to the PCs as a percentage
# Note the red dash line indicates the average contribution
fviz_contrib(pca, choice = "var", axes = 1, top = 5) # on PC1
fviz_contrib(pca, choice = "var", axes = 2, top = 5) # on PC2
# Visualize the contributions of the variables to the PCs in a table
get_pca_var(pca)$coord
## Dim.1 Dim.2 Dim.3 Dim.4 Dim.5
## UrbanPop% 0.8170718 -0.14462983 -0.3806190 -0.21671324 -0.002947619
## unemploy% 0.1548693 -0.10463804 -0.4650332 0.73889644 -0.404452165
## RuralPop% -0.8170718 0.14462983 0.3806190 0.21671324 0.002947619
## 65above% 0.7842379 0.11313034 0.3795558 -0.21724387 -0.227608461
## Sanitation% 0.8099983 0.01499984 -0.1601688 -0.21152843 -0.107625245
## Alcohol% 0.6207213 0.04661748 0.5426893 -0.08681380 -0.252557176
## Obesity 0.8403316 -0.15867501 -0.3156407 -0.01270116 0.191956365
## Confirmed% 0.7744871 0.19227217 0.2874412 0.38868273 0.141004360
## Death% 0.7151574 0.20610430 0.3131063 0.41094933 0.115949654
## Population -0.2049740 0.82385372 -0.2230074 -0.20322946 -0.399420518
## yearly_test 0.1518711 0.84705073 -0.2488066 0.06187560 0.359456007
## Dim.6 Dim.7 Dim.8 Dim.9 Dim.10
## UrbanPop% 0.12139866 -0.22401026 0.08160311 -0.055193464 0.030200243
## unemploy% 0.10965153 0.13637576 0.05511899 -0.014218089 0.004022261
## RuralPop% -0.12139866 0.22401026 -0.08160311 0.055193464 -0.030200243
## 65above% -0.02691828 0.19237524 0.32250542 0.050700522 -0.128714051
## Sanitation% -0.37677305 0.28912852 -0.11189648 -0.108366840 0.143161087
## Alcohol% 0.40478463 0.05007399 -0.21859071 -0.006788501 0.083001770
## Obesity 0.01943614 0.14086609 -0.21222879 0.237516604 -0.149680619
## Confirmed% -0.12286424 -0.14551136 -0.10619581 -0.239396117 -0.151565069
## Death% -0.17146028 -0.21204544 0.07832926 0.212131965 0.149452969
## Population -0.12323793 -0.20433772 -0.11483954 0.069828031 -0.059977823
## yearly_test 0.23610434 0.21653469 0.08259900 -0.045781256 0.049232654
## Dim.11
## UrbanPop% 2.016544e-16
## unemploy% 0.000000e+00
## RuralPop% 2.016544e-16
## 65above% 7.915399e-33
## Sanitation% 1.583080e-32
## Alcohol% 3.957700e-32
## Obesity 3.166160e-32
## Confirmed% 3.957700e-33
## Death% 3.957700e-32
## Population 5.936550e-33
## yearly_test 1.088367e-32
# Visualize the contributions of the variables to the PCs in a correlation circle
fviz_pca_var(pca, col.var = "black",
repel = TRUE) # Avoid text overlapping
# Visualize the individuals according to PC1 and PC2
fviz_pca_ind(pca,
geom.ind = "point", # show points only (nbut not "text")
palette = c("#00AFBB", "#E7B800", "#FC4E07"),
addEllipses = TRUE, # Concentration ellipses
legend.title = "outcome"
)
#Total Variation Explained by the 2 PCs
44.4+14.2
## [1] 58.6
In the code chunk, we first scaled the ‘covid2020num’ dataset and removed missing values to ready it for PCA analysis. Then, we conducted PCA on the scaled dataset. We visualized the percentage of variance explained by each PC. The first PC explained about 44.4% of the variation, the second PC explained about 14.2% of the variation, the third PC explains about 12.3% of the variation, and the fourth PC explains about 10% of the variation. Technically, we would need to include 4 dimensions to have at least 80% variance explained, but for visualization purposes we will be sticking 2 principal components since that is easy to visualize. We then visualized which were the top 5 variables contributing to each of the 2 principal components. The top 5 contributing variables to the first principle component are ‘Obesity’, ‘RuralPop%’, ‘UrbanPop%’, ‘Sanitation%’, and ‘65above%’, and all these 5 variables contributed around an equal amount to PC1 ranging from 10-15% contribution per variable. The top 5 contributing variables to the second principle component are ‘yearly_test’, ‘population’, ‘Death%’, ‘Confirmed%’, and ‘Obesity’, and here only ‘yearly_test’ and ‘Population’ contributed significantly to PC2 with each variable contributing around 40%, while the other 3 variables each had contributions below 5% to PC2. Based on the values of how each variable contributed to each PC, we determined what it would mean to score high in each PC. If a country were to score high in PC1, it would mean that the country has a high urban population %, a moderately high unemployment rate, a low rural population %, a high population of people above 65 years old, a high availability of sanitation facilities, a relatively high alcohol consumption %, a high obesity rate, a high % of COVID-19 confirmed cases, a high percentage of COVID deaths, a moderately low population, and moderate amount of COVID-19 tests conducted yearly compared to other countries. If a country were to score high in PC2, it would mean that they have a moderately low urban population %, moderately low employment %, moderately high rural population %, moderately high percentage of people above the 65, moderate availability of sanitation facilities, moderate consumption of alcoholic beverages, moderately high % of COVID-19 confirmed cases and deaths, very high population, and very high amount of COVID-19 tests conducted yearly. We then visualized the PCs in a correlation circle and in terms of where the individual countries lay. Overall, the 2 PCs displayed on the graph explained about 58.6% of the total variation.
# Prepare the data (drop the categorical variable 'Country' because it has too many categories) for Gower dissimilarities
covid2020gow <- covid2020 %>%
select(-Country)
# Apply 'gower' metric to the 'covid2020gow' dataset and save it as the matrix 'covid2020_gower'
covid2020_gower <- daisy(covid2020gow, metric = "gower") %>%
# Save as a matrix
as.matrix
# Save an object 'test' which looks at the distances between pairs of countries
test <- covid2020_gower %>%
# Save 'covid2020_gower' as a dataframe
as.data.frame %>%
# ID each row
rownames_to_column("country1") %>%
# Cross the ID of the country
pivot_longer(-1, names_to = "country2", values_to = "distance") %>%
# Get rid of pairs of the same country
filter(country1 != country2) %>%
# Avoid having the same pairs
distinct(distance, .keep_all = TRUE)
#View the distances between the country pairs
test
## # A tibble: 5,151 × 3
## country1 country2 distance
## <chr> <chr> <dbl>
## 1 1 2 0.221
## 2 1 3 0.170
## 3 1 4 0.111
## 4 1 5 0.178
## 5 1 6 0.121
## 6 1 7 0.282
## 7 1 8 0.142
## 8 1 9 0.251
## 9 1 10 0.135
## 10 1 11 0.176
## # … with 5,141 more rows
#Determine the number of optimum clusters to run for 'pam'
fviz_nbclust(covid2020_gower, pam, method = "silhouette")
#Run pam clustering for 'covid2020_gower' with 2 clusters
pam_results <- pam(covid2020_gower, k = 2, diss = TRUE)
#Have a look at the pam_results
pam_results
## Medoids:
## ID
## [1,] "96" "96"
## [2,] "63" "63"
## Clustering vector:
## 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20
## 1 2 1 1 1 1 2 1 1 1 1 1 1 1 1 1 2 1 1 1
## 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40
## 1 1 1 1 1 1 1 1 2 1 1 1 1 1 2 1 2 1 1 2
## 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60
## 1 1 1 1 1 1 1 1 2 1 1 1 1 2 2 2 2 1 2 1
## 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80
## 1 1 2 2 2 2 1 1 2 1 1 2 1 1 1 2 1 1 1 2
## 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100
## 1 2 1 1 1 1 2 1 1 1 2 2 1 1 2 1 1 1 1 1
## 101 102
## 2 2
## Objective function:
## build swap
## 0.1297519 0.1249905
##
## Available components:
## [1] "medoids" "id.med" "clustering" "objective" "isolation"
## [6] "clusinfo" "silinfo" "diss" "call"
#Determine the countries at the centers of each cluster
covid2020[96,]
## # A tibble: 1 × 12
## Country `UrbanPop%` `unemploy%` `RuralPop%` `65above%` `Sanitation%`
## <chr> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 Ukraine 69.6 9.13 30.4 16.9 97.7
## # … with 6 more variables: `Alcohol%` <dbl>, Obesity <dbl>, `Confirmed%` <dbl>,
## # `Death%` <dbl>, Population <dbl>, yearly_test <dbl>
covid2020[63,]
## # A tibble: 1 × 12
## Country `UrbanPop%` `unemploy%` `RuralPop%` `65above%` `Sanitation%`
## <chr> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 Mozambique 37.1 3.81 62.9 2.86 37.2
## # … with 6 more variables: `Alcohol%` <dbl>, Obesity <dbl>, `Confirmed%` <dbl>,
## # `Death%` <dbl>, Population <dbl>, yearly_test <dbl>
#Determine the silhouette width for running pam with 2 clusters
pam_results$silinfo$avg.width
## [1] 0.4362948
#Now add the clustering results to the 'covid2020gow' dataset and overwrite that dataset as 'covid_pam'
covid_pam <- covid2020gow %>%
mutate(cluster = as.factor(pam_results$clustering)) %>% na.omit
First, we selected all variables besides the country variable and saved it as a new dataset called covid2020gow. We then calculated gower’s distance between all the observations with the daisy function and saved it as a matrix named ‘covid2020_gower’. Then with this dataset, using dplyr, we found the gower distance between each country and saved it as a new dataset called ‘test’. Then, fviz_nbclust was used to find that the optimal number of clusters needed for our covid2020_gower dataset if pam clustering was run on it. Based on the fviz_nbclust, 2 clusters were done for pam because 2 clusters had the highest average silhouette width on this graph. With this, PAM was performed on our covid2020_gower dataset and saved to pam_results. Within pam_results, the center for cluster 1 was found to be Ukraine and the center of cluster 2 was found to be Mozambique. The silhouette width for running pam with 2 clusters was found to be about .436 which indcated that the structure was weak and could be artificial. The clustering results were then added to the covid2020gow dataset and saved to the covid_pam dataset.
# Visualize the clusters by showing all pairwise combinations of variables colored by cluster assignment
ggpairs(covid_pam, columns = 1:11, aes(color = cluster))
We visualized the clusters by showing all pairwise combinations of variables colored by clusters from the covid_pam dataset. The visualization shows the values for overall correlation and cluster-specific correlation between pairs of variables. The visualization also shows the distribution of values in the different clusters for 2 variables at a time in pairwise graphs.
##Clustering and PCA
# Import the HDI dataset for different countries around the world
HDI <- read_csv("HDI.csv")
# Inner join the covid2020 dataset with the HDI dataset and call the merged dataset 'covid2020hdi'
covid2020hdi <- covid2020 %>%
inner_join(HDI, by = c("Country" = "country")) %>%
#Remove the variable 'pop2022' from the merged dataset
select(!pop2022) %>%
# Add a variable named 'development' which has the value of 'developed' if HDI > 0.7 or the value of 'developing' if HDI < 0.7
mutate(development = ifelse(hdi > .7, "developed", NA)) %>%
mutate(development = ifelse(is.na(development) & hdi < .7, "developing", development))
# Remove NA values from the 'covid2020hdi' dataset
covid2020hdi <- covid2020hdi %>%
na.omit()
# Determine the number of optimum clusters for pam on the 'covid2020scaled' dataset
fviz_nbclust(covid2020scaled, pam, method="silhouette")
# Run pam with 2 clusters for the 'covid2020scaled' dataset
pam_results2 <- covid2020scaled %>%
pam(k = 2)
#Have a look at the pam_results2
pam_results2
## Medoids:
## ID UrbanPop% unemploy% RuralPop% 65above% Sanitation% Alcohol%
## 96 91 0.212998 0.2345297 -0.212998 0.7433827 0.5828775 -0.0143861
## 63 59 -1.354486 -0.7760267 1.354486 -1.2682073 -1.9196184 -1.0260110
## Obesity Confirmed% Death% Population yearly_test
## 96 0.618261 0.1490423 0.1048607 -0.09296293 -0.1377675
## 63 -1.600466 -1.0446309 -0.9795512 -0.14656258 -0.2700055
## Clustering vector:
## 1 4 5 6 7 8 9 10 12 13 14 15 16 17 18 19 20 21 22 23
## 1 1 1 1 2 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
## 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43
## 1 1 1 1 1 2 1 1 1 1 1 2 1 2 1 1 2 1 1 1
## 44 45 46 47 48 49 50 51 52 53 54 55 57 58 59 60 61 62 63 65
## 1 1 1 1 1 2 1 1 1 1 2 2 2 1 2 1 1 1 2 2
## 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85
## 2 1 1 2 1 1 2 1 1 1 2 1 1 1 2 1 2 1 1 1
## 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102
## 1 2 1 1 1 1 2 1 1 2 1 1 1 1 1 2 2
## Objective function:
## build swap
## 2.440543 2.440543
##
## Available components:
## [1] "medoids" "id.med" "clustering" "objective" "isolation"
## [6] "clusinfo" "silinfo" "diss" "call" "data"
#Determine the countries at the centers of each cluster
covid2020[96,]
## # A tibble: 1 × 12
## Country `UrbanPop%` `unemploy%` `RuralPop%` `65above%` `Sanitation%`
## <chr> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 Ukraine 69.6 9.13 30.4 16.9 97.7
## # … with 6 more variables: `Alcohol%` <dbl>, Obesity <dbl>, `Confirmed%` <dbl>,
## # `Death%` <dbl>, Population <dbl>, yearly_test <dbl>
covid2020[63,]
## # A tibble: 1 × 12
## Country `UrbanPop%` `unemploy%` `RuralPop%` `65above%` `Sanitation%`
## <chr> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 Mozambique 37.1 3.81 62.9 2.86 37.2
## # … with 6 more variables: `Alcohol%` <dbl>, Obesity <dbl>, `Confirmed%` <dbl>,
## # `Death%` <dbl>, Population <dbl>, yearly_test <dbl>
# Add the clustering results to the 'covid2020scaled' dataset
covidscaled_pam <- covid2020scaled %>%
mutate(cluster = as.factor(pam_results2$clustering))
# Visualize the clustering results on a PCA graph making sure to show the development status of the countries in each cluster
fviz_cluster(pam_results2, data = covid2020scaled,
shape = as.factor(covid2020hdi$development)) +
geom_point(aes(shape = as.factor(covid2020hdi$development))) +
guides(shape = guide_legend(title = "shape"))
# Table showing the relation of clusters made to the development status of the countries
table(covidscaled_pam$cluster, covid2020hdi$development)
##
## developed developing
## 1 71 3
## 2 3 20
#Calculate accuracy
(71+20)/97
## [1] 0.9381443
For this code chunk, we hoped to see how well our clusters matched the development status groups for the countries. To do this, we first imported the HDI dataset for different countries around the world. We then inner joined our ‘covid2020’ dataset with the HDI dataset by country names and kept all variables besides the pop2022 variable from the HDI dataset. We mutated the dataset by creating another variable called “development” in which countries with an hdi (human development index) score above .7 were classified as developed while countries with an hdi score below .7 were classified as developing. We then overwrote our ‘covid2020hdi’ dataset to exclude NA values. We utilized fvis_nbclust on our ‘covid2020scaled’ dataset from the PCA chunk of code above and found the optimal number of clusters needed for this scaled dataset was 2 as it had the highest average silhouette width. We then performed PAM on our ‘covid2020scaled’ dataset with a k value of 2 and the results were saved to pam_results2. Within pam_results2, the center for cluster 1 was found to be Ukraine and the center of cluster 2 was found to be Mozambique. We then mutated our ‘covid2020’ scaled dataset to include the clustering found from pam_results2 and called the variable cluster. We then visualized our clustering against our development variable using the fviz_cluster function in which the shapes represented the development variable while the color represented the cluster for each observation. We then used the table function to find the similarities between our development variable and clustering from our PAM and found that our clustering was about 93.8% accurate in determining the development status of each country.
##Classification
#For each country, assign the 'developed' status a value of '1' and 'developing' status a value of 0
covid2020hdi <- covid2020hdi %>% mutate (actual = ifelse(hdi> 0.7, 1, 0))
# Add a new variable called 'positivityrate' which is a function of the variables 'Confirmed%', 'Population', 'yearly_test'
covid2020hdi <- covid2020hdi %>% mutate(positivityrate = (( `Confirmed%`/100)*Population*100)/(yearly_test))
# Use a glm model to create a fit which shows development status based on 'positivityrate', 'Confirmed%', and 'Death%'
fit <- glm(actual ~ positivityrate + `Confirmed%` + `Death%`, data = covid2020hdi, family = "binomial")
summary(fit)
##
## Call:
## glm(formula = actual ~ positivityrate + `Confirmed%` + `Death%`,
## family = "binomial", data = covid2020hdi)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -1.91122 0.00326 0.08566 0.42729 1.53137
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -0.8032 0.4303 -1.867 0.0620 .
## positivityrate -0.2896 0.8479 -0.342 0.7327
## `Confirmed%` 1.4288 0.5893 2.424 0.0153 *
## `Death%` 7.3647 20.5058 0.359 0.7195
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 106.260 on 96 degrees of freedom
## Residual deviance: 62.763 on 93 degrees of freedom
## AIC: 70.763
##
## Number of Fisher Scoring iterations: 7
# Calculate a predicted probability based on the fit
log_covid2020hdi <- covid2020hdi %>%
mutate(score = predict(fit, type = "response"),
predicted = ifelse(score < 0.5, 0, 1))
log_covid2020hdi
## # A tibble: 97 × 18
## Country `UrbanPop%` `unemploy%` `RuralPop%` `65above%` `Sanitation%`
## <chr> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 Albania 62.1 13.3 37.9 14.7 99.3
## 2 Armenia 63.3 21.2 36.7 11.8 93.9
## 3 Australia 86.2 6.46 13.8 16.2 100.
## 4 Austria 58.7 5.36 41.3 19.2 100.
## 5 Bangladesh 38.2 5.41 61.8 5.23 54.2
## 6 Belarus 79.5 4.77 20.5 15.6 97.9
## 7 Belgium 98.1 5.55 1.92 19.3 99.5
## 8 Bolivia 70.1 7.90 29.9 7.49 65.8
## 9 Botswana 70.9 24.9 29.1 4.51 80.0
## 10 Brazil 87.1 13.7 12.9 9.59 90.1
## 11 Bulgaria 75.7 5.12 24.3 21.5 86.0
## 12 Canada 81.6 9.46 18.4 18.1 99.0
## 13 Chile 87.7 11.2 12.3 12.2 100.
## 14 China 61.4 5 38.6 12.0 92.4
## 15 Colombia 81.4 15.0 18.6 9.06 93.7
## 16 Costa Rica 80.8 17.4 19.2 10.3 97.9
## 17 Croatia 57.6 7.51 42.4 21.3 96.6
## 18 Cuba 77.2 2.62 22.8 15.9 91.4
## 19 Cyprus 66.8 7.59 33.2 14.4 99.4
## 20 Czech Republic 74.1 2.55 25.9 20.1 99.1
## 21 Denmark 88.1 5.64 11.9 20.2 99.6
## 22 Dominican Repub… 82.5 6.13 17.5 7.53 87.2
## 23 Ecuador 64.2 6.11 35.8 7.59 91.5
## 24 El Salvador 73.4 6.25 26.6 8.65 82.4
## 25 Estonia 69.2 6.80 30.8 20.4 99.1
## 26 Ethiopia 21.7 3.24 78.3 3.54 8.91
## 27 Fiji 57.2 4.72 42.8 5.82 99.2
## 28 Finland 85.5 7.76 14.5 22.6 99.4
## 29 France 81.0 8.01 19.0 20.8 98.6
## 30 Georgia 59.5 18.5 40.5 15.3 85.8
## 31 Germany 77.5 3.81 22.5 21.7 99.2
## 32 Ghana 57.3 4.65 42.7 3.14 23.7
## 33 Greece 79.7 16.3 20.3 22.3 99.0
## 34 Guatemala 51.8 3.55 48.2 5.04 67.9
## 35 Hungary 71.9 4.25 28.1 20.2 98.0
## 36 Iceland 93.9 5.48 6.10 15.6 98.8
## 37 India 34.9 8.00 65.1 6.57 71.3
## 38 Iraq 70.9 14.1 29.1 3.44 100.
## 39 Ireland 63.7 5.62 36.3 14.6 91.3
## 40 Israel 92.6 4.33 7.41 12.4 99.9
## 41 Italy 71.0 9.16 29.0 23.3 99.9
## 42 Jamaica 56.3 9.48 43.7 9.08 86.6
## 43 Japan 91.8 2.80 8.22 28.4 99.9
## 44 Jordan 91.4 19.0 8.58 3.95 97.1
## 45 Kazakhstan 57.7 4.89 42.3 7.90 97.9
## 46 Kenya 28.0 5.73 72.0 2.51 32.7
## 47 Kuwait 100 3.54 0 3.04 100
## 48 Latvia 68.3 8.10 31.7 20.7 92.4
## 49 Lithuania 68.0 8.49 32.0 20.6 93.9
## 50 Luxembourg 91.5 6.77 8.55 14.4 97.6
## 51 Madagascar 38.5 2.47 61.5 3.10 12.3
## 52 Malawi 17.4 6.70 82.6 2.64 26.6
## 53 Maldives 40.7 6.33 59.3 3.59 99.2
## 54 Malta 94.7 4.26 5.26 21.3 100.
## 55 Mauritania 55.3 11.3 44.7 3.18 49.8
## 56 Mexico 80.7 4.45 19.3 7.62 92.4
## 57 Mongolia 68.7 7.01 31.3 4.31 67.7
## 58 Morocco 63.5 11.5 36.5 7.61 87.3
## 59 Mozambique 37.1 3.81 62.9 2.86 37.2
## 60 Namibia 52.0 21.4 48.0 3.59 35.3
## 61 Nepal 20.6 4.72 79.4 5.83 76.6
## 62 Netherlands 92.2 3.82 7.76 20.0 97.7
## 63 New Zealand 86.7 4.59 13.3 16.4 100.
## 64 Nigeria 52.0 9.71 48.0 2.74 42.7
## 65 North Macedonia 58.5 17.2 41.5 14.5 98.3
## 66 Norway 83.0 4.42 17.0 17.5 98.1
## 67 Pakistan 37.2 4.30 62.8 4.35 68.4
## 68 Panama 68.4 12.9 31.6 8.54 84.6
## 69 Paraguay 62.2 7.55 37.8 6.81 92.7
## 70 Peru 78.3 7.18 21.7 8.73 78.6
## 71 Philippines 47.4 2.52 52.6 5.51 82.3
## 72 Poland 60.0 3.16 40.0 18.7 100.
## 73 Portugal 66.3 6.79 33.7 22.8 99.6
## 74 Romania 54.2 5.03 45.8 19.2 87.1
## 75 Rwanda 17.4 1.49 82.6 3.12 68.8
## 76 Saudi Arabia 84.3 7.45 15.7 3.50 100
## 77 Senegal 48.1 3.62 51.9 3.11 56.8
## 78 Serbia 56.4 9.01 43.6 19.1 97.9
## 79 Slovenia 55.1 4.97 44.9 20.7 98.1
## 80 South Africa 67.4 29.2 32.6 5.51 78.5
## 81 Spain 80.8 15.5 19.2 20.0 99.9
## 82 Sri Lanka 18.7 5.88 81.3 11.2 93.7
## 83 Suriname 66.1 9.78 33.9 7.13 90.0
## 84 Sweden 88.0 8.29 12.0 20.3 99.3
## 85 Switzerland 73.9 4.82 26.1 19.1 99.9
## 86 Thailand 51.4 1.10 48.6 13.0 98.7
## 87 Togo 42.8 3.94 57.2 2.91 18.6
## 88 Trinidad and To… 53.2 4.57 46.8 11.5 93.9
## 89 Turkey 76.1 13.1 23.9 8.98 99.2
## 90 Uganda 25.0 2.77 75.0 1.99 19.8
## 91 Ukraine 69.6 9.13 30.4 16.9 97.7
## 92 United Arab Emi… 87.0 3.19 13.0 1.26 99.2
## 93 United Kingdom 83.9 4.47 16.1 18.7 99.1
## 94 United States 82.7 8.05 17.3 16.6 99.7
## 95 Uruguay 95.5 10.4 4.49 15.1 98.1
## 96 Zambia 44.6 12.8 55.4 2.13 31.9
## 97 Zimbabwe 32.2 5.35 67.8 3.01 35.2
## # … with 12 more variables: `Alcohol%` <dbl>, Obesity <dbl>,
## # `Confirmed%` <dbl>, `Death%` <dbl>, Population <dbl>, yearly_test <dbl>,
## # hdi <dbl>, development <chr>, actual <dbl>, positivityrate <dbl>,
## # score <dbl>, predicted <dbl>
# Confusion matrix: compare true to predicted condition
table(log_covid2020hdi$actual, log_covid2020hdi$predicted) %>% addmargins
##
## 0 1 Sum
## 0 16 7 23
## 1 10 64 74
## Sum 26 71 97
# Calculate accuracy based on confusion matric
80/97
## [1] 0.8247423
# Visualize a ROC curve for the glm model
ROC <- ggplot(log_covid2020hdi) +
geom_roc(aes(d = actual, m = score), n.cuts = 0)
ROC
# Calculate the area under the curve for the ROC model
calc_auc(ROC)
## PANEL group AUC
## 1 1 -1 0.8830787
The covid2020hdi dataset was mutated by adding a new variable called “actual” in which countries with an hdi scores above .7, or developed countries, were given a value of 1 while countries with an hdi score below .7 (developing countries) were given a value of 0. Also, a new variable called positivityrate was added to the dataset in which positivityrate was a function of confirmed%, population, and yearly_test. We then used the glm function on our covid2020hdi data to create a fit which demonstrated the development status based on positivityrate, confirmed%, and death%. This fit was then summarized and displayed. We then calculated a predicted probability based on the fit and saved it to log_covid2020hdi. We then used the table function to compare the predicted and true values. The accuracy was found to be about 82.5%. We then visualized a ROC curve for the glm model and saved it to ROC and found the area under the curve of the ROC model to be about 0.883 which indicated that the model was a good fit.
##Cross-validation with the ‘k-fold’ method
# Choose number of folds
k = 10
# Randomly order rows in the dataset
data <- covid2020hdi[sample(nrow(covid2020hdi)), ]
#Set Seed to save results
set.seed(11)
# Create k folds from the dataset
folds <- cut(seq(1:nrow(data)), breaks = k, labels = FALSE)
# Use a for loop to get diagnostics for each test set
diags_k <- NULL
for(i in 1:k){
# Create training and test sets
train <- data[folds != i, ] # all observations except in fold i
test <- data[folds == i, ] # observations in fold i
# Train model on training set (all but fold i)
fit <- glm(actual ~ positivityrate + `Confirmed%` + `Death%`, data = train, family = "binomial")
# Test model on test set (fold i)
df <- data.frame(
probability = predict(fit, newdata = test, type = "response"),
actual = test$actual)
# Consider the ROC curve for the test dataset
ROC <- ggplot(df) +
geom_roc(aes(d = actual, m = probability, n.cuts = 0))
# Get diagnostics for fold i (AUC)
diags_k[i] <- calc_auc(ROC)$AUC
}
# Average performance
mean(diags_k)
## [1] 0.8731151
The k-fold cross-validation method was used to see if there were any signs of overfitting. The code chunk above cut the data into 10 folds in which 9 folds were used as the training set and the other 1 fold was used as the test set, this process was repeated till each fold was at least used once as the test set. The average performance obtained from the k-folds method indicated an average area the curve of 0.881 which is very close to the area under the curve of the original model which was 0.883. Thus, the model for the entire dataset does not show any signs of over-fitting as the average AUC of the k-folds cross-validation method almost matched the AUC of the original model, indicating that the model would remain a good fit even if any new data is added.
‘Health Nutrition and Population Statistics’ (HealthStats.csv): Website Link
Context: This dataset was obtained from World Bank, and this dataset contained many population and health statistics for countries around the world.
‘diet2020’ (Food_Supply_kcal_Data.csv): Website Link
Context: This dataset was obtained from Kaggle, and this dataset contained many dietary intake variables, health conditions associated with nutrition, and COVID-19 related data for countries around the world.
‘totaltests’ (total-tests-for-covid-19.csv): Website Link
Context: This dataset was obtained from the Our World in Data website and contained the total COVID-19 tests conducted daily in many countries around the world
‘HDI’ (HDI.csv): Website Link
Context: The dataset shows the HDI values and population values of different countries around the world